Journal of Chemical Information and Modeling
● American Chemical Society (ACS)
All preprints, ranked by how well they match Journal of Chemical Information and Modeling's content profile, based on 207 papers previously published here. The average preprint has a 0.21% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Sellner, M. S.; Lill, M. A.; Smiesko, M.
Show abstract
The efficient and accurate prediction of protein-ligand binding affinities is an extremely appealing yet still unresolved goal in computational pharmacy. In recent years, many scientists have taken advantage of the remarkable progress of deep learning and applied it to address this issue. Despite all the advances in this field, there is increasing evidence that the typically applied validation of these methods is not suitable for medicinal chemistry applications. This work assesses the importance of dataset quality and proper dataset splitting techniques demonstrated on the example of the PDBbind dataset. We also introduce a new tool for the analysis of protein-ligand complexes, called po-sco. Po-sco allows the extraction of interaction information with much higher detail and comprehensibility than the tools available to date. We trained a transformer-based deep learning model to generate protein-ligand interaction fingerprints that can be utilized for downstream predictions, such as binding affinity. When using po-sco, this model generated predictions that were superior to those based on commonly used PLIP and ProLIF tools. We also demonstrate that the quality of the dataset is more important than the number of data points and that suboptimal dataset splitting can lead to a significant overestimation of model performance.
Surendran, A.; Zsigmond, K.; Quintana, R. A. M.
Show abstract
The visualization of high-dimensional chemical space is a critical tool for under-standing molecular diversity, structure-property relationships, and for guiding compound selection. However, the performance of non-linear dimensionality reduction (DR) techniques like t-Stochastic Neighborhood Embedding (t-SNE), Uniform Man-ifold Approximation and Projection (UMAP), and Generative Topographic Mapping (GTM) are often susceptible to the choice of hyperparameters, along with the high cost of their training for large datasets. In this study, we investigated the effect of undersampling methods on the choice of hyperparameter selection for these non-linear dimensionality reduction methods. Our results demonstrate that selecting small representative subsets of chemical data not only reduces computational costs associated with hyperparameter training but also serves as an innovative means to train non-linear DR methods, leading to projections that better preserve the local structure within the chemical space.
Lee, K. H.; Won, S. J.; Oyinloye, P.; Shi, L.
Show abstract
The dopamine transporter (DAT) plays a critical role in the central nervous system and has been implicated in numerous psychiatric disorders. The ligand-based approaches are instrumental to decipher the structure-activity relationship (SAR) of DAT ligands, especially the quantitative SAR (QSAR) modeling. By gathering and analyzing data from literature and databases, we systematically assemble a diverse range of ligands binding to DAT, aiming to discern the general features of DAT ligands and uncover the chemical space for potential novel DAT ligand scaffolds. The aggregation of DAT pharmacological activity data, particularly from databases like ChEMBL, provides a foundation for constructing robust QSAR models. The compilation and meticulous filtering of these data, establishing high-quality training datasets with specific divisions of pharmacological assays and data types, along with the application of QSAR modeling, prove to be a promising strategy for navigating the pertinent chemical space. Through a systematic comparison of DAT QSAR models using training datasets from various ChEMBL releases, we underscore the positive impact of enhanced data set quality and increased data set size on the predictive power of DAT QSAR models.
Yang, Y.-Y.; Pickersgill, R. W.; Fornili, A.
Show abstract
The geometric and chemical features of protein binding sites tend to change as a consequence of conformational dynamics. In the ligand-unbound (apo) state, a binding site might be only transiently organised in a way that can accommodate a given ligand, with the relevant regions of the protein coming together in a suitable arrangement only in a subset of conformations. Ligand binding itself can also induce further changes in the binding site. Because most ligands can be decomposed into smaller fragments, we hypothesised that mapping onto the binding site surface the propensity of binding specific fragments could be used to monitor changes in the overall ability of the site to bind a ligand. This task can be formulated as semantic segmentation, which can now be performed successfully using deep learning methods. Here we introduce the Fragment-Based protein Ensemble semantic Segmentation Tool for Myosin (FragBEST-Myo), a deep learning method based on a 3D U-Net architecture, trained to partition the omecamtiv mecarbil (OM) binding site of cardiac myosin into fragment-specific regions using only local shape and physico-chemical features. The model was trained on labelled Molecular Dynamics (MD) trajectories of OM-bound myosin in both post-rigor (PR) and pre-power-stroke (PPS) states, achieving an accuracy of ~95% and a mean Intersection over Union (mIoU) > 0.75 on unseen trajectories from both states. When applied to apo trajectories, FragBEST-Myo-derived descriptors produced rankings consistent with similarity to holo conformations. Moreover, selecting apo frames based on FragBEST-Myo ranking increased the chance of recovering holo-like OM docking poses relative to randomly chosen control frames, supporting its use as a screening tool for ensemble docking. Beyond frame selection, fragment maps provide a compact representation to assess docking poses and to guide fragment-based design. Our proof-of-concept provides a basis for developing future general models applicable to a broader range of proteins and ligands, with the fragment-based formulation offering a natural route to generalisation. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=103 SRC="FIGDIR/small/703009v1_ufig1.gif" ALT="Figure 1"> View larger version (32K): org.highwire.dtl.DTLVardef@1551255org.highwire.dtl.DTLVardef@26bf73org.highwire.dtl.DTLVardef@1e3181forg.highwire.dtl.DTLVardef@44baf7_HPS_FORMAT_FIGEXP M_FIG C_FIG
Torrisi, M.; Asadollahi, S.; de la Vega de Leon, A.; Wang, K.; Copeland, W.
Show abstract
In recent years, several chemical language models have been developed, inspired by the success of protein language models and advancements in natural language processing. In this study, we explore whether pre-training a chemical language model on billion-scale compound datasets, such as Enamine and ZINC20, can lead to improved compound representation in the drug space. We compare the learned representations of these models with the de facto standard compound representation, and evaluate their potential application in drug discovery and development by benchmarking them on biophysics, physiology, and physical chemistry datasets. Our findings suggest that the conventional masked language modeling approach on these extensive pre-training datasets is insufficient in enhancing compound representations. This highlights the need for additional physicochemical inductive bias in the modeling beyond scaling the dataset size.
Igashov, I.; Jamasb, A. R.; Sadek, A.; Sverrisson, F.; Schneuing, A.; Lio, P.; Blundell, T. L.; Bronstein, M.; Correia, B.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWSmall molecules have been the preferred modality for drug development and therapeutic interventions. This molecular format presents a number of advantages, e.g. long half-lives and cell permeability, making it possible to access a wide range of therapeutic targets. However, finding small molecules that engage "hard-to-drug" protein targets specifically and potently remains an arduous process, requiring experimental screening of extensive compound libraries to identify candidate leads. The search continues with further optimization of compound leads to meet the required potency and toxicity thresholds for clinical applications. Here, we propose a new computational workflow for high-throughput fragment-based screening and binding affinity prediction where we leverage the available protein-ligand complex structures using a state-of-the-art protein surface embedding framework (dMaSIF). We developed a tool capable of finding suitable ligands and fragments for a given protein pocket solely based on protein surface descriptors, that capture chemical and geometric features of the target pocket. The identified fragments can be further combined into novel ligands. Using the structural data, our ligand discovery pipeline learns the signatures of interactions between surface patches and small pharmacophores. On a query target pocket, the algorithm matches known target pockets and returns either potential ligands or identifies multiple ligand fragments in the binding site. Our binding affinity predictor is capable of predicting the affinity of a given protein-ligand pair, requiring only limited information about the ligand pose. This enables screening without the costly step of first docking candidate molecules. Our framework will facilitate the design of ligands based on the targets surface information. It may significantly reduce the experimental screening load and ultimately reveal novel chemical compounds for targeting challenging proteins.
Wang, J.; Dokholyan, N. V.
Show abstract
The design of molecules for flexible protein pockets represents a significant challenge in structure-based drug discovery, as proteins often undergo conformational changes upon ligand binding. While deep learning-based approaches have shown promise in molecular generation, they typically treat protein pockets as rigid structures, limiting their ability to capture the dynamic nature of protein-ligand interactions. Here, we introduce YuelDesign, a novel diffusion-based framework specifically developed to address this challenge. YuelDesign employs a new protein encoding scheme with a fully connected graph representation to encode protein pocket flexibility, a systematic denoising process that refines both atomic properties and coordinates, and a specialized bond reconstruction module tailored for de novo generated molecules. Our results demonstrate that YuelDesign generates molecules with favorable drug-likeness and low synthetic complexity. The generated molecules also exhibit diverse chemical functional groups, including some not even present in the training set. Redocking analysis reveals that the generated molecules exhibit docking energies comparable to native ligands. Additionally, a detailed analysis of the denoising process shows how the model systematically refines molecular structures through atom type transitions, bond dynamics, and conformational adjustments. Overall, YuelDesign presents a versatile framework for generating novel molecules tailored to flexible protein pockets, with promising implications for drug discovery applications.
Wills, S.; Sanchez-Garcia, R.; Roughley, S. D.; Merritt, A.; Hubbard, R. E.; von Delft, F.; Deane, C. M.
Show abstract
The efficiency of fragment-to-lead optimization could be improved by automated workflows for the design of follow-up compounds. Pipelines that are able to fully exploit the interaction opportunities identified from the crystal structures of bound fragments would greatly aid this goal. To do so, these pipelines need to require minimal intervention from the user and be computationally efficient. In this work, we describe an updated version of our fragment merging methodology, which provides several feature enhancements, primarily by expanding the chemical space searched, allowing the identification of more diverse follow-up compounds, thus maximizing the chances of finding successful hits. While the original method focused on finding perfect merges, meaning compounds that directly incorporate substructures from the original fragments, here we expand the search to what we term bioisosteric merges, involving the incorporation of substructures that replicate the pharmacophoric features of the original fragments but may not be exactly identical. Unlike existing pharmacophore and shape-based descriptors used for virtual screening, this approach combines the search for these properties with the incorporation of novelty, which is necessary when searching for ways to link together distinct substructures. Compared with perfect merging, our new approach is able to find compounds that are directly informed by structures within the original fragments but are more chemically diverse. We contrast our approach with the use of a pharmacophore-constrained docking pipeline, run in parallel for select fragment pairs, and show that our method requires between 1.1-45.9-fold less computational time for conformer generation per merging hit identified, referring to compounds that show a favourable degree of shape and colour overlap and recapitulation of original fragment interactions. Overall, our results show that our method has potential to be used to generate designs inspired by all fragments within a given pocket.
Zhao, H.; Nittinger, E.; Tyrchan, C.
Show abstract
Chemical space exploration has gained significant interest with the increase in available building blocks, which enables the creation of ultra-large virtual libraries containing billions or even trillions of compounds. However, the challenge of selecting most suitable compounds for synthesis arises, and one such challenge is hit expansion. Recently, Thompson sampling, a probabilistic search approach, has been proposed by Walters et al. to achieve efficiency gains by operating in the reagent space rather than the product space. Here, we aim to address some of its shortcomings and propose optimizations. We introduce a warmup routine to ensure that initial probabilities are set for all reagents with a minimum number of molecules evaluated. Additionally, a roulette wheel selection is proposed with adapted stop criteria to improve sampling efficiency, and belief distributions of reagents are only updated when they appear in new molecules. We demonstrate that a 100% recovery rate can be achieved by sampling 0.1% of the fully enumerated library, showcasing the effectiveness of our proposed optimizations.
Albanese, S. K.; Chodera, J. D.; Volkamer, A.; Keng, S.; Abel, R.; Wang, L.
Show abstract
Alchemical free energy calculations are now widely used to drive or maintain potency in small molecule lead optimization with a roughly 1 kcal/mol accuracy. Despite this, the potential to use free energy calculations to drive optimization of compound selectivity among two similar targets has been relatively unexplored in published studies. In the most optimistic scenario, the similarity of binding sites might lead to a fortuitous cancellation of errors and allow selectivity to be predicted more accurately than affinity. Here, we assess the accuracy with which selectivity can be predicted in the context of small molecule kinase inhibitors, considering the very similar binding sites of human kinases CDK2 and CDK9, as well as another series of ligands attempting to achieve selectivity between the more distantly related kinases CDK2 and ERK2. Using a Bayesian analysis approach, we separate systematic from statistical error and quantify the correlation in systematic errors between selectivity targets. We find that, in the CDK2/CDK9 case, a high correlation in systematic errors suggests free energy calculations can have significant impact in aiding chemists in achieving selectivity, while in more distantly related kinases (CDK2/ERK2), the correlation in systematic error suggests fortuitous cancellation may even occur between systems that are not as closely related. In both cases, the correlation in systematic error suggests that longer simulations are beneficial to properly balance statistical error with systematic error to take full advantage of the increase in apparent free energy calculation accuracy in selectivity prediction.
Zhang, R.; Jiang, X.; Cao, D.; Yu, J.; Chen, M.; Fan, Z.; Kong, X.; Xiong, J.; Zhang, Z.; Zhang, W.; Ni, S.; Wang, Y.; Gao, S.; Zheng, M.
Show abstract
Structure-based drug design (SBDD) relies on accurate knowledge of protein structure and ligand-binding conformations. However, most of the static conformations obtained by advanced methods such as structural biology and de novo protein folding algorithms often dont meet the needs for drug design. We introduce PackDock, a flexible docking method that combines "conformation selection" and "induced fit" mechanisms in a two-stage docking pipeline. The core module of this method is PackPocket, which uses a diffusion model to explore the side-chain conformation space in ligand binding pockets, both with or without a ligand. We evaluate our method using several tests that reflect real-world application scenarios. (1) Side-chain packing and Re-docking experiments validate the ability of PackDock to predict accurate side-chain conformations and ligand conformations. (2) Cross-docking experiments with apo and non-homologous ligand-induced holo structures align with real docking scenarios, demonstrating PackDocks practical value. (3) Docking experiments with hypothetical models show that PackPocket can potentially conduct SBDD starting from protein sequence information only. Additionally, we found that PackDock can identify key amino acid conformation changes, which may provide insights for lead compound optimization. We demonstrate PackDock can accurately predict the complex conformations in various application scenarios, by combining the conformation selection theory and the induced fit theory, and by using the ability of PackPocket to accurately predict the side chain conformations in the pocket region. We believe this method can improve the usability of existing structures, providing a new perspective for the SBDD community.
Secker, C.; Secker, P.; Yergoez, F.; Celik, M. O.; Chewle, S.; Phuong Nga Le, M.; Masoud, M.; Christgau, S.; Weber, M.; Gorgulla, C.; Nigam, A.; Pollice, R.; Schuette, C.; Fackeldey, K.
Show abstract
The identification of suitable lead molecules in the vast chemical space is a critical and challenging task in drug discovery campaigns. Recently, it has been demonstrated that large-scale virtual screening provides a powerful approach to accelerate the identification of novel drug candidates by screening ever increasing virtual ligand libraries, which have reached magnitudes of > 1020 compounds. However, this desirable increase in potentially bioactive molecules poses a new challenge as enumerating and virtually screening such huge compound libraries is computationally prohibitive. Consequently, advanced approaches to navigate ultra-large chemical spaces and to identify suitable candidate molecules therein are urgently needed. Here, we present an evolutionary algorithm framework using molecular generative AI, reaction-based substructure searching, and iterative model fine-tuning for a targeted and efficient exploration of chemical fragment spaces. Combining this approach with large-scale virtual screening we are able to identify target-specific candidate molecules within the commercially available Enamine REAL Space ([~]1015). We demonstrate the applicability of the approach by successfully identifying and biochemically validating pH-specific ligands of the {micro}-opioid receptor. Our results demonstrate that integrating generative AI with evolutionary algorithms provides a promising route to explore ultra-large chemical spaces for the discovery of novel, synthetically accessible lead molecules.
Jung, N.; Park, H.; Yang, J.; Seok, C.
Show abstract
Virtual screening has long been a central computational tool for rational ligand discovery, enabling the systematic prioritization of candidate molecules from large chemical libraries. Although docking and related approaches that explicitly account for receptor-ligand interactions have been developed and refined over several decades, achieving both reliable receptor-aware interaction modeling and computational scalability remains an open challenge, particularly for ultra-large chemical spaces. Ligand-based methods are fast and robust but do not explicitly incorporate receptor structure, whereas docking-based approaches model receptor-ligand interactions more directly at substantially higher computational cost. Here, we present G-screen, a freely available and scalable receptor-aware virtual screening framework designed for cases in which a reference protein-ligand complex structure is available. Instead of performing full docking, G-screen rapidly aligns candidate ligands to the reference ligand using a flexible global alignment algorithm (G-align) and evaluates receptor-aware pharmacophore interactions derived from the reference complex, thereby combining the efficiency of ligand-based alignment with explicit atomic-level interaction analysis. Benchmarking on DUD-E, LIT-PCBA, and MUV datasets demonstrates that G-screen achieves competitive discrimination and early enrichment relative to representative ligand-based and docking-based methods, while maintaining millisecond-scale per-molecule runtimes under multi-threaded execution. These results position G-screen as a practical and scalable receptor-aware screening strategy for efficiently filtering large chemical libraries when a reference complex structure is available. Scientific ContributionWe have developed a scalable virtual screening framework for efficiently filtering ultra-large chemical libraries using a flexible global alignment algorithm combined with receptor-aware pharmacophore evaluations. Despite explicitly capturing atomic-level interactions, the screening process using this method is highly efficient, maintaining millisecond-scale per-molecule runtimes under parallel execution. It achieves competitive discrimination and early enrichment, successfully bridging the speed of ligand-based approaches with the structural context of traditional docking.
Manrique, P. D.; Leus, I.; Lopez, C. A.; Mehla, J.; Malloci, G.; Gervasoni, S.; Vargiu, A.; Kinthada, R.; Herndon, L.; Hengartner, N.; Walker, J. K.; Rybenkov, V.; Ruggerone, P.; Zgurskaya, H.; Gnanakaran, S.
Show abstract
The ability of Gram-negative pathogens to adapt and protect themselves against antibiotics is a growing threat to public health. The low permeability of the outer membrane (OM) in combination with effective multidrug efflux pumps, constitute the two main antibiotic resistance mechanisms. Though much efforts have been devoted to discover new antibiotics that can bypass these defense mechanisms, no new antibiotic classes have been introduced into clinics in the last 35 years. Models that identify specific descriptors of molecular properties and predict the likelihood that a given compound is capable of successfully permeate the OM and inhibit bacterial growth while avoiding efflux could facilitate the discovery of novel classes of antibiotics. Here we evaluate 174 molecular descriptors of 1260 antimicrobial compounds and study their correlations with antibacterial activity in Gram-negative Pseudomonas aeruginosa. While part of these descriptors are computed using traditional approaches based on the physicochemical properties intrinsic to the compounds, ensemble docking and all-atom molecular dynamics (MD) simulations are used to derive additional bacterium-specific mechanistic properties. Descriptors of compound permeation across the OM were calculated using all-atom MD simulations of the compounds in different subregions of the OM model. Descriptors of interactions with efflux pumps were calculated from ensemble docking of compounds targeting specific binding pockets of MexB, the major efflux transporter of P. aeruginosa. Using these descriptors and the measured antibacterial inhibitory concentrations of compounds, we design and implement a statistical protocol to identify a subset of the molecular properties that are predictive of whether a given compound is a strong or weak permeator across the Gram-negative OM. Our results indicate that 88.4% of the compounds that show measurable antibacterial activity, follow very consistent rules of permeation, which highlight the critical role that the interaction between the compound and the OM have at predicting permeation. The remaining 11.6% of the compounds, although less predictive, are characterized by distinctive structural markers that can be used to minimize classification errors. An implementation of the permeation rules and the structural markers uncovered in our study is shown, and it demonstrates the accuracy of our approach in a set of previously unseen compounds. Taken together, our analysis sheds new light on the key molecular properties that drug candidates should have in order to be effective at OM permeation/inhibition of P. aeruginosa, and opens the gate to similar data-driven studies in other Gram-negative pathogens.
Hong, S. H.; Kim, H.; Kang, S.
Show abstract
We propose FragDockRL, a novel molecular design framework that combines building block-based virtual synthesis reactions with reinforcement learning. The method aims to efficiently generate synthesizable molecules by using docking scores as rewards without relying on large molecular databases. Experiments on three targets, VGFR2, FA10, and CSF1R, demonstrate a consistent improvement in the average docking scores of generated molecules as learning progresses. FragDockRL enables molecule generation reflecting realistic synthetic routes, effectively reduces the search space while maintaining chemical diversity, and improves pose prediction accuracy in the tethered docking process by leveraging core structures and 3D coordinate information. Although limitations such as core structure duplication and constraints in fine structural optimization remain, future work will address these issues through improved reaction selection algorithms and multi-objective optimization strategies. FragDockRL offers a molecular design platform that simultaneously considers synthetic feasibility and binding affinity, contributing to efficient hit generation in fragment-based drug discovery.
Lim, J.
Show abstract
HyperLab, developed by HITS, is a web-based, AI-driven drug discovery platform designed to increase research efficiency for experimental drug discovery researchers. The platform features an intuitive user interface and experience (UI/UX), enabling researchers without specialized expertise in AI or computational methods to readily generate essential discovery outcomes. By employing a Structure-Based Drug Discovery (SBDD) methodology, HyperLab streamlines the complete discovery workflow. Its core functionalities include the prediction of ligand-protein structures and binding activities (Hyper Binding), protein structure-based molecular optimization (Hyper Design), structure-activity-relationship (SAR) analysis, screening of extensive chemical libraries ranging from one million to seven trillion compounds (Hyper Screening and Hyper Screening X), prediction of ADMET properties (Hyper ADME/T), and an AI-driven assistant designed to boost researcher productivity and efficiency. In benchmark evaluations, Hyper Binding demonstrated 77% accuracy for binding pose prediction on the PoseBuster v2 benchmark, surpassing conventional docking techniques and closely approaching the 84% accuracy of AlphaFold3, while offering a considerable advantage in computational speed. Furthermore, for binding affinity prediction, HyperBinding showed superior performance over both deep learning and physicsbased docking models on two distinct FEP datasets, achieving Pearson correlation coefficients of 0.70 and 0.53, respectively. For experimental validation, an internal study utilized Hyper Screening to identify top-ranked compounds without without any post-analysis or visual inspection by human experts. The screening process was completed in 24 hours. Through experimental validation of top-ranked compounds, five compounds with IC50 values ranging from 70 to 600 nM were identified. Hyper Design was employed to generate novel derivatives with improved predicted binding scores and structural novelty. Of five synthesized compounds, in vitro assays confirmed that three demonstrated over 75% inhibition at a 1 {micro}M concentration, with IC50 values in the 200 to 400 nM range. Notably, one compound exhibited activity comparable or superior to the reference compound. HyperLab is therefore positioned to substantially lower the barriers of time, cost, and specialized expertise inherent in modern drug discovery initiatives.
Voitsitskyi, T.; Bdzhola, V.; Stratiichuk, R.; Koleiev, I.; Ostrovsky, Z.; Vozniak, V.; Khropachov, I.; Henitsoi, P.; Popryho, L.; Zhytar, R.; Yesylevskyy, S. O.; Nafiev, A.; Starosyla, S.
Show abstract
This study introduces the PocketCFDM generative diffusion model, aimed at improving the prediction of small molecule poses in the protein binding pockets. The model utilizes a novel data augmentation technique, involving the creation of numerous artificial binding pockets that mimic the statistical patterns of non-bond interactions found in actual protein-ligand complexes. An algorithmic method was developed to assess and replicate these interaction patterns in the artificial binding pockets built around small molecule conformers. It is shown that the integration of artificial binding pockets into the training process significantly enhanced the models performance. Notably, PocketCFDM surpassed DiffDock in terms of non-bond interaction quality, number of steric clashes, and inference speed. Future developments and optimizations of the model are discussed. AvailabilityThe inference code and final model weights of PocketCFDM are accessible publicly via the GitHub repository: https://github.com/vtarasv/pocket-cfdm.git.
Wang, Z.; Zhou, F.; Wang, Z.; Li, Y.-Q.; Wang, S.; Zheng, L.; Li, W.; Peng, X.
Show abstract
Accurate protein-ligand binding poses are the prerequisites of structure-based binding affinity prediction, and also provide the structural basis for in depth lead optimization in small molecule drug design. Ligand-based modeling approaches primarily extract valuable information from the structural features of small molecules to assess their potential as drug candidates against specific targets. However, it is challenging to provide reasonable predictions of binding poses for different molecules, due to the complexity and diversity of the chemical space of small molecules. Similarity-based molecular alignment techniques can effectively narrow the search range, as structurally similar molecules are likely to have similar binding modes, with higher similarity usually correlating to higher success rates. However, molecular similarity isnt consistently high because molecules often require changes to achieve specific purposes, leading to reduced alignment precision. To address this issue, we propose a new alignment method--Z-align. This method uses topological structural information as a criterion for evaluating similarity, reducing the reliance on molecular fingerprint similarity. Our method has achieved significantly higher success rates than other methods at moderate levels of similarity. Additionally, our approach can comprehensively and flexibly optimize bond lengths and angles of molecules, maintaining high accuracy even when dealing with larger molecules. Consequently, our proposed solution helps in achieving more accurate binding poses in protein-ligand docking problems, facilitating the development of small molecule drugs.
Tian, L.; Huang, H. H.; Martin, E.; Wei, Y.; Xu, S.; Huang, J.
Show abstract
Despite over a half-century of effort by computational chemists, developing accurate empirical QSAR (Quantitative Structure-Activity Relationships) models for predicting bioactivity directly from chemical structure has remained elusive. The difficulties have been especially pronounced for virtual screening, finding new active compounds substantially different from the known chemical matter used to train the models. Recent breakthroughs have been achieved by employing transfer of learning across huge numbers of bioactivity assays, greatly increasing the amount and diversity of chemical and biological information that informs each model. An early example was Profile-QSAR (pQSAR), a 2-level stacked model, where level-2 PLS (Partial Least Squares) models characterize compounds by their profile of bioactivity predictions from individual level-1 random forest regression QSAR models built on up to 10,000+ other assays. This study introduces metaNN, a meta-learner that trains deep neural networks (DNN) for each individual assay initialised from a well-generalized consensus DNN optimized across all assays. Comparison of the results suggested that while Profile-QSAR and metaNN perform similarly overall, metaNN works slightly better for smaller assays which were well-predicted by the consensus DNN; whereas pQSAR struggled more with smaller assays, due to the large number of level-1 models but was less sensitive to similarity to an overall consensus. An ensemble average of both methods combined the strengths of each, working better than either alone. The similar performance of the 2 largely orthogonal algorithms raises questions about whether we are approaching a limit of prediction accuracy in transfer learning, for this application scenario.
Kapoor, K.; Thangapandian, S.; Tajkhorshid, E.
Show abstract
Proteins can sample a broad landscape as they undergo conformational transition between different functional states. As key players in almost all cellular processes, proteins are important drug targets. Considering the different conformational states of a protein is therefore central for a successful drug-design strategy. Here we introduce a novel docking protocol, termed as extended-ensemble docking, pertaining to proteins that undergo large-scale (global) conformational changes during their function. In its application to multidrug ABC-transporter P-glycoprotein (Pgp), extensive non-equilibrium molecular dynamics simulations employing system-specific collective variables capturing the alternate access mechanism of Pgp, are first used to construct the transition cycle of the transporter. An extended set of conformational states representing the full transition between the inward- and the outward-facing states of Pgp, is then used to seed high-throughput docking calculations of a set of known substrates, non-substrates, and modulators of the transporter. Large differences are observed in the predicted binding affinities to the conformational ensemble, with compounds showing stronger binding affinities to intermediate conformations compared to the starting crystal structure. Hierarchical clustering of the individual binding modes of the different compounds shows all ligands preferably bind to the large central cavity of the protein, formed at the apex of the transmembrane domain (TMD), whereas only small binding populations are observed in the previously described R and H sites present within the individual TMD leaflets. Based on the results, the central cavity is further divided into two major subsites: first subsite preferably binds smaller substrates and high-affinity inhibitors, whereas the second one shows preference for larger substrates and low-affinity modulators. These central sites along with the low-affinity interaction sites present within the individual TMD leaflets may respectively correspond to the proposed high- and low-affinity binding sites in Pgp. We propose further optimization strategy for developing more potent inhibitor of Pgp, based on increasing its specificity to the extended ensemble of the protein instead of using a single protein structure, as well as its selectivity for the high-affinity binding site. In contrast to earlier in-silico studies using single static structures of Pgp, our results show better agreement with experimental studies, pointing to the importance of incorporating the global conformational flexibility of proteins in future drug-discovery endeavors.